1. Introduction¶
Thanks to Heeral Dedhia for sharing Groceries dataset. Kaggle link
This dataset contains information on supermarket purchases, including a personal identifier, date, and product. The analysis pursues two primary objectives: first, to evaluate the supermarket’s current performance and monitor sales trends over time; and second, to uncover significant patterns in customer behavior. The resulting insights are leveraged to develop data-driven strategies aimed at optimizing operational efficiency, enhancing the customer experience, and ultimately increasing profitability. All principal findings and strategic recommendations are presented in the final Executive Summary.
2. Load & Exploration¶
import pandas as pd
df = pd.read_csv('Groceries_dataset.csv')
print(f"Rows: {df.shape[0]} | Cols: {df.shape[1]}")
print(f"Nº nulls:{df.isnull().sum().sum()}")
df.head()
Rows: 38765 | Cols: 3 Nº nulls:0
| Member_number | Date | itemDescription | |
|---|---|---|---|
| 0 | 1808 | 21-07-2015 | tropical fruit |
| 1 | 2552 | 05-01-2015 | whole milk |
| 2 | 2300 | 19-09-2015 | pip fruit |
| 3 | 1187 | 12-12-2015 | other vegetables |
| 4 | 3037 | 01-02-2015 | whole milk |
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.dtypes
| 0 | |
|---|---|
| Member_number | int64 |
| Date | datetime64[ns] |
| itemDescription | object |
The dataset comprises shopping records from 3,898 customers. Each purchased item is recorded in a separate row, meaning that a single shopping trip may span multiple rows. Consequently, repeated purchases of the same product by the same customer on the same date may appear more than once.
print("Nº Clients:", df['Member_number'].nunique())
print("Nº Items:", df['itemDescription'].nunique())
print("The shopping list is expressed in different rows ->", df.duplicated(subset=['Date', 'Member_number']).any())
print("Even for the purchase of the same product ->", df.duplicated(subset=['Date', 'Member_number', 'itemDescription']).any())
Nº Clients: 3898 Nº Items: 167 The shopping list is expressed in different rows -> True Even for the purchase of the same product -> True
2.1 Sale of products¶
See historical sales performance (by product sum).
import plotly.express as px
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
monthly_sales = df.groupby(df['Date'].dt.to_period('M')).size().reset_index(name='Items Purchased')
monthly_sales['Date'] = monthly_sales['Date'].dt.to_timestamp()
fig = px.line(monthly_sales, x='Date', y='Items Purchased', title='Monthly Sales | Figure 1', line_shape='spline', markers=True)
iplot(fig)
The period analysed covers January 2014 to December 2015. Clearly, sales in 2014 were significantly lower, reaching their lowest point in February-March 2014. This was followed by a period of low sales that continued until December 2014. From then on, in January 2015, sales rose significantly, growing markedly compared to the previous year and peaking in August 2015.
See the best-selling products.
top_items = df['itemDescription'].value_counts().head(25).reset_index()
fig = px.bar(top_items, x='itemDescription', y='count',
title='Top 25 Items Purchased | Figure 2',
labels={'itemDescription': 'Items', 'count': 'Nº Purchases'},
text='count')
fig.update_layout(xaxis_tickangle=-45)
iplot(fig)
Thus, there is a preference for purchasing products related to breakfast or sweets (milk, rolls/buns, yoghurt, tropical fruit, pastry, etc.). The distribution of products resembles a cumulative log-normal distribution. With high initial frequencies and a slight continuous decline that causes a long tail to the right.
Another important insight is to identify seasonal trend products.
# Group by month and product
df_seasonal = df.copy()
df_seasonal['Month'] = df_seasonal['Date'].dt.month
monthly_counts = (
df_seasonal.groupby(['itemDescription', 'Month'])
.size()
.reset_index(name='count')
)
# Calc perc & filter
monthly_counts['total_item'] = monthly_counts.groupby('itemDescription')['count'].transform('sum')
monthly_counts['prop'] = monthly_counts['count'] / monthly_counts['total_item']
def filter_seasonal(df, min_annual_freq=20, min_count_per_month=15, min_prop=0.115):
return df[(df['total_item'] >= min_annual_freq) & (df['count'] >= min_count_per_month) & (df['prop'] >= min_prop)].reset_index(drop=True)
seasonal = filter_seasonal(monthly_counts)
seasonal = seasonal.sort_values('Month')
print("Nº prodcuts:", seasonal.shape[0])
print("Does any product appear in 2 months or more? ->", seasonal['itemDescription'].duplicated().any())
print("Repeated products:")
for item in seasonal[seasonal['itemDescription'].duplicated()]['itemDescription'].unique():
print(f"- {item:<15} | Months ->", seasonal[seasonal['itemDescription'] == item]['Month'].values)
Nº prodcuts: 32 Does any product appear in 2 months or more? -> True Repeated products: - dishes | Months -> [1 5] - cat food | Months -> [1 6] - white wine | Months -> [7 8] - chewing gum | Months -> [6 8] - hard cheese | Months -> [4 8] - flour | Months -> [5 9] - misc. beverages | Months -> [ 4 11]
Under the minimum criteria used to identify a product as seasonal (change the parameters as desired), 32 instances have been identified. For a better understanding, see the graph below.
Please note that the Y axis does not imply cumulative probability.
import plotly.express as px
fig = px.bar(
seasonal,
x='Month', y='prop', color='itemDescription',
text='itemDescription',
title="Top seasonal products | Figure 3",
labels={'Month': 'Month',
'prop': 'Sales Proportion',
'itemDescription': 'Product',
'count': 'Sold this month'},
hover_data={'count': True, 'prop': ':.2%'}
)
fig.update_traces(textposition='outside', textfont_size=12, cliponaxis=False)
fig.update_layout(xaxis_title="Month", yaxis_title="Annual Proportion of Sales by Product")
iplot(fig)
These 32 products exceed a purchase proportion of 11.5%, whereas a uniform distribution would be expected to yield 8.3%. Although an average has been calculated for the two years as a whole and some of these may appear by chance (like detergent), the following products are noteworthy.
Table 1
| Product | Month | Prop (%) | Frequency |
|---|---|---|---|
| Dishes | January | 14.0% | 19 |
| Hamburger meat | January | 12.2% | 40 |
| Frozen meals | January | 13.9% | 35 |
| Flour | May | 13.7% | 20 |
| Cat food | January & June | 13.0%, 14.1% | 23, 25 |
| White wine | July & August | 11.9%, 11.9% | 21, 21 |
| Cream cheese | August | 13.4% | 48 |
| Packaged or pickled fruit/vegetables | August | 13.3% + 14.2% | 17 + 19 |
| Hard cheese | August | 13.2% | 29 |
| Semi-finished bread | October | 13.4% | 19 |
| Canned fish | November | 14.7% | 17 |
| Misc. beverages | November | 12.2% | 29 |
Concluding that:
- Customers purchase cat food every 5-7 months. The supermarket can take advantage of this by creating a pull effect. For example, by lowering the price of these products to attract visitors to the shop.
- White wine is the only product that remains popular for two consecutive months. This can be exploited to offer a wider variety of these wines during July and August.
- These and other strategies can be used to make profits based on sales proportion and frequency. For instance, since many hamburgers are purchased in January (40), it is possible to raise the price by 2%.
In any case, trends may vary from year to year. A good way to identify local trends is to conduct a more detailed survey over a two-week period.
# Same but biweekly, considering the first 15 days or later
df_seasonal = df.copy()
df_seasonal['YearMonth'] = df_seasonal['Date'].dt.to_period('M')
df_seasonal['Half'] = df_seasonal['Date'].dt.day.map(lambda x: 1 if x <= 15 else 2)
df_seasonal['Period'] = df_seasonal['YearMonth'].astype(str) + "-Q" + df_seasonal['Half'].astype(str)
biweekly_counts = (
df_seasonal.groupby(['itemDescription', 'Period'])
.size()
.reset_index(name='count')
)
biweekly_counts['total_item'] = biweekly_counts.groupby('itemDescription')['count'].transform('sum')
biweekly_counts['prop'] = biweekly_counts['count'] / biweekly_counts['total_item']
biweekly = filter_seasonal(biweekly_counts, min_annual_freq=20, min_count_per_month=5, min_prop=0.055)
biweekly = biweekly.sort_values('Period')
print("Nº prodcuts:", biweekly.shape[0])
print("Is there any product that appears in two or more biweekly periods? ->", biweekly['itemDescription'].duplicated().any())
print("Repeated products:")
for item in biweekly[biweekly['itemDescription'].duplicated()]['itemDescription'].unique():
print(f"- {item:<25} | Dates ->", biweekly[biweekly['itemDescription'] == item]['Period'].values)
Nº prodcuts: 25 Is there any product that appears in two or more biweekly periods? -> True Repeated products: - salt | Dates -> ['2014-07-Q2' '2014-12-Q2'] - turkey | Dates -> ['2015-03-Q2' '2015-11-Q1' '2015-12-Q2'] - finished products | Dates -> ['2015-06-Q2' '2015-12-Q2']
import plotly.express as px
fig = px.bar(
biweekly,
x='Period', y='prop', color='itemDescription',
text='itemDescription',
title="Top seasonal products (biweekly) | Figure 4",
category_orders={'Period': list(biweekly['Period'])},
labels={'Period': 'Period',
'prop': 'Sales Proportion',
'itemDescription': 'Product',
'count': 'Sold 2-week Period'},
hover_data={'count': True, 'prop': ':.2%'}
)
fig.update_traces(textposition='outside', textfont_size=12)
fig.update_layout(xaxis_title="Period", yaxis_title="Annual Proportion of Sales by Product")
iplot(fig)
In this case, the uniform probability (non-trending) would be 4.2%, with the products shown exceeding 5.5%.
This more granular version shows few sales peaks per product and the absence of a repetitive or sustained trend over the years. The most notable observation is the surge in turkey sales between November and December 2015, which, despite involving only 12 products, accounted for 15% of total sales over the two-year period.
2.2 Analysis of purchases by customer¶
See the recurrence of customer purchases.
unique_purchases = df[['Member_number', 'Date']].drop_duplicates()
purchase_counts = unique_purchases['Member_number'].value_counts().reset_index()
purchase_counts.columns = ['Member_number', 'purchase_day']
purchase_counts = purchase_counts['purchase_day'].value_counts().sort_index().reset_index()
purchase_counts.columns = ['purchase_day', 'num_clients']
purchase_counts["perc"] = ((purchase_counts["num_clients"] / purchase_counts["num_clients"].sum() *
100).round(2)).astype(str) + "%"
fig = px.bar(purchase_counts, x='purchase_day', y='num_clients',
title='Customers and Number of Days they Purchased | Figure 5',
text='perc',
labels={'purchase_day': 'Days',
'num_clients': 'Number of clients',
'perc': 'Percentage of clients'})
fig.update_layout(xaxis_type='category')
iplot(fig)
Hence, it is evident that only ~9% of customers make a single purchase (they do not return), while the most common is to purchase 2, 3 or 4 times. After that, this trend decreases sharply.
Since the date range covers two years, it is possible to ascertain the low loyalty of customers, all of whom are very unlikely to return. For a deeper understanding of customer return, a cohort chart is proposed.
df2 = df.copy()
# Date conversion
df2['Date'] = pd.to_datetime(df2['Date'], dayfirst=True)
df2['Date'] = df2['Date'].dt.to_period('M').dt.to_timestamp()
# Cohort as first date and time since first purchase
df2 = df2.merge(df2.groupby('Member_number', as_index=False).agg(Cohort=('Date', 'min')), on='Member_number')
df2['Month'] = (df2['Date'].dt.to_period('M').astype(dtype="int64") -
df2['Cohort'].dt.to_period('M').astype(dtype="int64"))
df_cohort = (df2.groupby(['Cohort', 'Month'])['Member_number'].nunique().unstack('Month'))
# Conversion to percentage & plot
for c in df_cohort.columns[1:]:
df_cohort[c] = df_cohort[c] / df_cohort[0]
df_cohort[0] = 1
fig = px.imshow(
df_cohort,
text_auto=".2%",
color_continuous_scale="darkmint",
range_color=[0, 0.7],
title='Cohort Analysis (Monthly) | Figure 6',
)
fig.update_xaxes(side="top", dtick=1)
fig.update_yaxes(dtick="M1")
iplot(fig)
# This code is an adaptation of the user's response: Rob Raymond
# Source: https://stackoverflow.com/questions/71796692/cohort-chart-using-plotly-library
To interpret this chart, it is important to understand that each row represents a cohort of customers who made their first purchase in a given month (e.g., Jan 2014). The first column is always 100%, because it represents all customers in that cohort. The following columns show the percentage of those same customers who returned in subsequent months. Darker cells indicate higher retention, lighter cells lower retention.
Key findings:
January 2014: The average retention rate is around 15%. However, in months 9–10 (October–November 2014), despite being months with fewer total purchases (see Figure 1), the retention rate is unusually high (~19%). This suggests that purchases in this period were largely made by returning customers rather than new ones.
May and June 2015: These cohorts stand out for very high retention during the first three months, followed by a sharp decline afterward. Interestingly, August 2015 also records the highest overall number of purchases, so these results merit deeper investigation (e.g., promotions, store-level changes, etc.).
Minimums: Some cohorts show abnormally low retention. For example, customers acquired in March 2015 returned at only ~5% in the following month, and those from August 2015 only ~3% in October. Since August 2014 did not show such a sharp decline, the seasonal effect of holidays alone does not explain this pattern.
Averages: On average, March 2015 has the lowest retention rate (~9.26%), while April 2015 has the highest (~15.41%). Overall, the global average retention is 13.45%, with most values falling between 11% and 18%. Therefore, values outside this range deserve special attention from the supermarket to investigate possible causes (e.g., marketing strategies, promotions, advertising, or operational issues).
Additional observations:
- Long-term retention is visible in some cohorts (e.g., January 2014 still retains more than 10% of customers after 23 months), indicating the presence of a loyal core customer base.
- Retention rates vary significantly across cohorts, but there is no clear upward or downward trend over time, which suggests that external or tactical factors (such as campaigns or promotions) may explain the observed differences.
# Used to calc avgs
# avgs = []
# for i in range(len(df_cohort)):
# list_i = df_cohort.iloc[i].tolist()
# list_i.pop(0) # do not include 100%
# list_i = [x for x in list_i if pd.notnull(x)]
# list_i = [x * 100 for x in list_i]
# if len(list_i) != 0:
# avg_list_i = sum(list_i) / len(list_i)
# avgs.append(avg_list_i)
# print(i, avg_list_i)
#
# mean_avgs = sum(avgs) / len(avgs)
# print(f"Global avg: {mean_avgs}")
In order to easily recognise which factors have had the most decisive influence on sales figures, a detailed analysis of anomalous periods is proposed.
import plotly.graph_objects as go
# Two lines: Nº item purchases & daily customer purchases (with multiple items).
daily_purchases = df.groupby(df['Date'].dt.date).size().reset_index(name='num_purchases')
daily_unique_purchases = unique_purchases.groupby(unique_purchases['Date'].dt.date).size().reset_index(name='num_purchases')
# IQRs
Q1 = daily_purchases['num_purchases'].quantile(0.25)
Q3 = daily_purchases['num_purchases'].quantile(0.75)
IQR = Q3 - Q1
Q1_u = daily_unique_purchases['num_purchases'].quantile(0.25)
Q3_u = daily_unique_purchases['num_purchases'].quantile(0.75)
IQR_u = Q3_u - Q1_u
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
lower_bound_u = Q1_u - 1.5 * IQR_u
upper_bound_u = Q3_u + 1.5 * IQR_u
# Save to plot
daily_purchases['is_outlier'] = (
(daily_purchases['num_purchases'] < lower_bound) |
(daily_purchases['num_purchases'] > upper_bound)
)
daily_unique_purchases['is_outlier'] = (
(daily_unique_purchases['num_purchases'] < lower_bound_u) |
(daily_unique_purchases['num_purchases'] > upper_bound_u)
)
# Plot
fig = go.Figure()
# Line 1 & outliers 1
fig.add_trace(go.Scatter(x=daily_purchases['Date'], y=daily_purchases['num_purchases'],
name='Daily products sold',
line=dict(color='dodgerblue', width=2),
mode='lines'))
fig.add_scatter(
x=daily_purchases[daily_purchases['is_outlier']]['Date'],
y=daily_purchases[daily_purchases['is_outlier']]['num_purchases'],
mode='markers',
marker=dict(size=8, color='red', symbol='x'),
name='Outliers Daily products sold')
# Line 2 & outliers 2
fig.add_scatter(
x=daily_unique_purchases['Date'],
y=daily_unique_purchases['num_purchases'],
mode='lines',
name='Daily customers',
line=dict(color='lightseagreen', width=2))
fig.add_scatter(
x=daily_unique_purchases[daily_unique_purchases['is_outlier']]['Date'],
y=daily_unique_purchases[daily_unique_purchases['is_outlier']]['num_purchases'],
mode='markers',
marker=dict(size=8, color='red', symbol='square-cross'),
name='Outliers Daily customers')
# Name axis & IQR lines
fig.update_layout(yaxis_title='Sales', xaxis_title='Date')
fig.update_layout(title='Daily Sales by Item & Customer | Figure 7')
fig.add_hline(y=lower_bound, line_color='dodgerblue', line_dash='dash')
fig.add_hline(y=upper_bound, line_color='dodgerblue', line_dash='dash')
fig.add_hline(y=lower_bound_u, line_color='lightseagreen', line_dash='dash')
fig.add_hline(y=upper_bound_u, line_color='lightseagreen', line_dash='dash')
# Include a slider as it is a very long period
# Source: https://plotly.com/python/time-series/ | section: Time Series With Range Slider
fig.update_xaxes(rangeslider_visible=True, range=['2014-01-01', '2014-12-31'])
iplot(fig)
To visualize specific periods, please use the slider or zoom as a filter.
Findings
Item purchases:
- In 2014, daily purchases hardly ever exceeded 80 items (2 cases), while in 2015 this happened on 21 occasions. Regarding the minimum values, this trend is also present but much less pronounced.
- No minimum is considered an outlier, whereas the highest values outside the interquartile range all occurred (all 5) in 2015.
- A weekly-level analysis shows that there is no sustained trend by day of the week. For example, April 4th was the day with the most purchases in the month, and it was a Thursday. Another monthly peak is May 16th, 2015, which was a Saturday, etc.
- Anomalous periods:
- 2014, February 14 – March 4: irregular period but with many item sales.
- 2014, April 1–3: unusually high number of sales.
- 2014, August 28: a peak close to the upper IQR, followed by high sales between September 1–7.
- 2015, January 21: an outlier followed by a good sales period until the 28th.
- 2015, March 16–17: almost no items sold.
- 2015, from April 30 onwards: an unprecedented number of positive peaks; despite a few bad sales days, this is by far the best period.
Purchases per customer:
- Outliers also occur only in the upper IQR. Interestingly, all of them are from 2014.
- In an annual comparison, 2015 shows many more lower-end peaks.
- Anomalous periods:
- 2014, April 3 and 13: many customers, although the rest of the month was normal.
- 2014, May 30: isolated outlier.
- 2014, August 28: very notable outlier, followed by surprisingly good days between September 2–7.
- 2014, October 2–10: very few customers, 18 and 21.
- 2014, October 27 and November 5: outliers; in the following days there are peaks close to +IQR.
- 2014, December 10–19: no outliers, but a period with very good sales.
- 2015, January 1–10: very few customers making purchases.
- 2015, January 21: close to the upper IQR, isolated date.
- 2015, August 13–21: while there are no peaks, it is a sustained and positive sales period.
- 2015, December 9–22: fewer than 20 customers throughout.
To gain a deeper understanding of customer behavior, the diversity of their purchase history will be examined using Shannon entropy.
import numpy as np
# Calc p
client_item_counts = df.groupby(['Member_number', 'itemDescription']).size().reset_index(name='item_count')
client_item_counts['total_client'] = client_item_counts.groupby('Member_number')['item_count'].transform('sum')
client_item_counts['p'] = client_item_counts['item_count'] / client_item_counts['total_client']
# Formula
# log2 used, see: "If the base 2 is used the resulting units may be called binary digits, or more briefly bits,[...]"
# source: SHANNON, A Mathematical Theory of Communication. October, 1948 https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf
client_item_counts['log'] = client_item_counts['p'] * np.log2(client_item_counts['p'])
entropy_df = client_item_counts.groupby('Member_number')['log'].sum().reset_index()
entropy_df['entropy'] = -entropy_df['log']
entropy_df = entropy_df.drop(columns='log')
# Top 10 most diverse clients
top10_entropy = entropy_df.sort_values(by='entropy', ascending=False).head(10)
top10_entropy
| Member_number | entropy | |
|---|---|---|
| 369 | 1379 | 4.664498 |
| 3774 | 4875 | 4.582119 |
| 1026 | 2051 | 4.559546 |
| 400 | 1410 | 4.532665 |
| 2244 | 3308 | 4.494680 |
| 1388 | 2433 | 4.478232 |
| 1994 | 3050 | 4.476064 |
| 2665 | 3737 | 4.476064 |
| 2396 | 3465 | 4.459432 |
| 2296 | 3361 | 4.459432 |
fig = px.scatter(entropy_df, x='Member_number', y='entropy',
title='Entropy of Purchase History by Client | Figure 8',
labels={'Member_number': 'Client ID', 'entropy': 'Entropy'})
fig.update_layout(xaxis_title='Client ID', yaxis_title='Entropy')
iplot(fig)
To understand this graph, a brief reminder:
- Shannon entropy is calculated as: H = -∑ p_i log_2(p_i)
- Where H is entropy and measures uncertainty/diversity in the distribution of purchases.
- p_i is the probability of purchasing item (the proportion of times that customer purchased the item relative to their total history)
Therefore, it can be interpreted:
- The higher the entropy, the more evenly distributed purchases are across different items.
- The lower the entropy, the more concentrated purchases are on a single product.
- H = 0 → The client always buys the same item (no diversity).
- H = 1 → Equivalent to splitting purchases evenly between 2 products.
- H ~ 1.4 → It could be the purchase of three products, distributed: [60%, 20%, 20%].
- H = 2 → Equivalent to splitting purchases evenly among 4 products.
- H = 3 → Equivalent to ~8 products evenly distributed.
- H = 3.5 → Equivalent to ~11 products evenly distributed.
- H = 4 → Equivalent to ~16 products evenly distributed.
Findings
- Only 6 clients have H = 0, meaning they always buy the same product.
- Many clients have H ≈ 1 or H ≈ 2, which means their purchases are concentrated in only 2–4 products.
- A large group of clients show entropy values around H = 3 or higher, which implies a much broader and more balanced variety of products in their baskets (typically 8–12 items).
- Most clients fall between H = 2 and H = 4.5, suggesting that supermarket customers tend to diversify their purchases considerably rather than sticking to only a few products.
Additionally, the absolute distribution of various purchases (unique items) is plotted for greater understanding.
fig = px.histogram(
x=df.groupby('Member_number')['itemDescription'].nunique(),
title='Unique Items Purchased per Client | Figure 9',
labels={'x': 'Unique Items Count', 'y': 'Frequency'}
)
fig.update_layout(xaxis_title='Unique Items Count', yaxis_title='Frequency', bargap=0.1)
iplot(fig)
It is clear from the histogram that most clients tend to buy between 4 and 14 distinct products. As the number of unique items increases, the frequency of clients decreases, meaning only a few customers (+300) purchase more than 15 different products.
By combining this histogram with the previous entropy analysis (Figure 8), it can be concluded that customers who purchase more than 15 different products usually distribute their purchases quite evenly across them. In other words, they do not tend to over-purchase a single item compared to the others. This conclusion follows from the fact that the Shannon entropy reaches its maximum when purchases are spread equally across all products: $H_{\text{max}} = \log_2(n)$
For example, if a client buys 25 products, each with an equal probability of 4%, the entropy would reach H ≈ 4.6, which matches the maximum values observed above.
print(np.log2(25))
print((1/25)*100)
4.643856189774724 4.0
3. Creating smart offers¶
So far, the analysis has focused on a more descriptive and surface-level exploration of the data. However, to truly uncover deeper patterns hidden within supermarket sales, it is necessary to move beyond simple summaries. This is where the Apriori algorithm comes into play.
Theoretical understanding
At the core of the algorithm are two concepts: transactions, which represent individual purchases (shopping baskets), and items, which are the products that make up each transaction. In other words, each transaction is a set of items bought together during a shopping event.
Several metrics are used by the algorithm, either during its execution or afterwards when evaluating the generated rules. Among them, support is crucial during the mining process, while the others become more relevant for the analysis stage:
$ \text{Support}(X) = \frac{\text{transactionsWithX}}{\text{totalTransactions}} $
- Example: if Coca-Cola appears in 50 out of 300 transactions, support(Coca-Cola) = 0.16
$ \text{Confidence}(X \rightarrow Y) = \frac{\text{support}(X∩Y)}{\text{support}(X)} $
- Example: if Coca-Cola and snacks are purchased together 25 times, and Coca-Cola appears in 50 transactions, confidence = 25/50 = 0.5
$ \text{Lift}(X \rightarrow Y) = \frac{\text{Confidence}(X \rightarrow Y)}{\text{Support}(Y)} $
- Lift measures how much more likely X and Y are to occur together compared to what would be expected if they were independent.
- Interpretation:
- Lift > 1 → X and Y appear together more often than expected (positive association).
- Lift ≈ 1 → independence.
- Lift < 1 → X and Y appear together less often than expected (negative association).
Formulas and examples adapted from IBM [1]
The Apriori algorithm is based on an iterative approach:
- Start with frequent itemsets of size 1 (single products).
- Generate candidate itemsets of size 2 from the frequent ones of size 1.
- Keep only those that meet the minimum support threshold.
- Repeat the process for itemsets of size 3, 4, and so on.
- The process stops when no larger itemsets can be formed.
Therefore, if an itemset is infrequent, all of its supersets will also be infrequent.
[1] Joshua Noble. What is the Apriori algorithm? [online]. 9 June,2024. [accessed on 21 April 2025] Available at: https://www.ibm.com/think/topics/apriori-algorithm
In order to get the most out of the data, two approaches to this algorithm are proposed.
3.1 First Approach: Customer-Level Transactions¶
In this first approach, each transaction is not a single purchase, but the entire purchase history of a customer.
This implies:
- Products that tend to appear together are identified across a customer’s lifetime, not necessarily in the same shopping trip.
- For instance, if Coca-Cola and snacks are bought on different occasions, the rule will still consider them connected.
- The results are useful to profile customers and understand long-term preferences.
- However, information is lost about which products are actually bought together in the same basket.
data_apriori = df.copy()
data_apriori = data_apriori.groupby(['Member_number', 'itemDescription']).size().unstack(fill_value=0)
data_apriori = (data_apriori > 0)
data_apriori.head(2)
| itemDescription | Instant food products | UHT-milk | abrasive cleaner | artif. sweetener | baby cosmetics | bags | baking powder | bathroom cleaner | beef | berries | ... | turkey | vinegar | waffles | whipped/sour cream | whisky | white bread | white wine | whole milk | yogurt | zwieback |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Member_number | |||||||||||||||||||||
| 1000 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | True | True | False |
| 1001 | False | False | False | False | False | False | False | False | True | False | ... | False | False | False | True | False | True | False | True | False | False |
2 rows × 167 columns
from mlxtend.frequent_patterns import apriori, association_rules
frequent_itemsets = apriori(data_apriori, min_support=0.04, use_colnames=True, max_len=3)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules = rules.sort_values(by='lift', ascending=False).reset_index(drop=True)
rules.head(15)
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | representativity | leverage | conviction | zhangs_metric | jaccard | certainty | kulczynski | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (sausage, whole milk) | (yogurt) | 0.106978 | 0.282966 | 0.044895 | 0.419664 | 1.483093 | 1.0 | 0.014624 | 1.235551 | 0.364754 | 0.130112 | 0.190644 | 0.289161 |
| 1 | (yogurt) | (sausage, whole milk) | 0.282966 | 0.106978 | 0.044895 | 0.158658 | 1.483093 | 1.0 | 0.014624 | 1.061426 | 0.454279 | 0.130112 | 0.057871 | 0.289161 |
| 2 | (sausage) | (yogurt, whole milk) | 0.206003 | 0.150590 | 0.044895 | 0.217933 | 1.447192 | 1.0 | 0.013873 | 1.086109 | 0.389179 | 0.144033 | 0.079282 | 0.258029 |
| 3 | (yogurt, whole milk) | (sausage) | 0.150590 | 0.206003 | 0.044895 | 0.298126 | 1.447192 | 1.0 | 0.013873 | 1.131253 | 0.363790 | 0.144033 | 0.116024 | 0.258029 |
| 4 | (sausage) | (other vegetables, rolls/buns) | 0.206003 | 0.146742 | 0.041816 | 0.202989 | 1.383305 | 1.0 | 0.011587 | 1.070572 | 0.348986 | 0.134488 | 0.065920 | 0.243977 |
| 5 | (other vegetables, rolls/buns) | (sausage) | 0.146742 | 0.206003 | 0.041816 | 0.284965 | 1.383305 | 1.0 | 0.011587 | 1.110431 | 0.324748 | 0.134488 | 0.099449 | 0.243977 |
| 6 | (whole milk, rolls/buns) | (shopping bags) | 0.178553 | 0.168291 | 0.041303 | 0.231322 | 1.374531 | 1.0 | 0.011254 | 1.081998 | 0.331707 | 0.135181 | 0.075784 | 0.238374 |
| 7 | (shopping bags) | (whole milk, rolls/buns) | 0.168291 | 0.178553 | 0.041303 | 0.245427 | 1.374531 | 1.0 | 0.011254 | 1.088625 | 0.327614 | 0.135181 | 0.081410 | 0.238374 |
| 8 | (bottled water) | (other vegetables, whole milk) | 0.213699 | 0.191380 | 0.056183 | 0.262905 | 1.373732 | 1.0 | 0.015285 | 1.097036 | 0.345995 | 0.161029 | 0.088453 | 0.278235 |
| 9 | (other vegetables, whole milk) | (bottled water) | 0.191380 | 0.213699 | 0.056183 | 0.293566 | 1.373732 | 1.0 | 0.015285 | 1.113056 | 0.336445 | 0.161029 | 0.101572 | 0.278235 |
| 10 | (pip fruit) | (pastry) | 0.170600 | 0.177527 | 0.041047 | 0.240602 | 1.355296 | 1.0 | 0.010761 | 1.083059 | 0.316076 | 0.133668 | 0.076689 | 0.235908 |
| 11 | (pastry) | (pip fruit) | 0.177527 | 0.170600 | 0.041047 | 0.231214 | 1.355296 | 1.0 | 0.010761 | 1.078843 | 0.318738 | 0.133668 | 0.073081 | 0.235908 |
| 12 | (other vegetables) | (sausage, rolls/buns) | 0.376603 | 0.082350 | 0.041816 | 0.111035 | 1.348337 | 1.0 | 0.010803 | 1.032268 | 0.414416 | 0.100246 | 0.031260 | 0.309412 |
| 13 | (sausage, rolls/buns) | (other vegetables) | 0.082350 | 0.376603 | 0.041816 | 0.507788 | 1.348337 | 1.0 | 0.010803 | 1.266521 | 0.281529 | 0.100246 | 0.210435 | 0.309412 |
| 14 | (shopping bags) | (other vegetables, whole milk) | 0.168291 | 0.191380 | 0.042842 | 0.254573 | 1.330196 | 1.0 | 0.010635 | 1.084774 | 0.298459 | 0.135223 | 0.078149 | 0.239217 |
Results
- Rule 1: about 4.5% of customers (support) buy sausages and whole milk, and among them, 42% also buy yogurt (confidence). With a positive non-independent association (lift = 1.48).
- Rule 2: in the opposite direction, when yogurt is purchased, only 16% of cases also include sausages and whole milk. The confidence is lower due to yogurt’s higher frequency (antecedent support = 0.28), though the lift remains 1.48. Zhang’s metric is medium-high (0.45), reinforcing the positive link. (See zhangs_metric)
- Rules 3 and 4: both highlight the relation among the same three products (sausages, whole milk, and yogurt). With supports of ~4.5% and lift = 1.45, the association is strong. Confidences vary (22% vs 30%), but overall they show a clear three-item connection.
- Rule 5: about 4.2% of customers buy rolls/buns and other vegetables. In 28% of these cases, sausages are also purchased at some point. With lift = 1.38 and Zhang’s metric = 0.32, this is unlikely to be random. This may reflect a meal/snack pattern.
- Rule 6: when sausages are bought, in 20% of cases rolls/buns and other vegetables will also be added. The lift is again 1.38, supporting the pattern in Rule 5, although with lower confidence.
- Rule 7: about 17.8% of customers buy rolls/buns with whole milk, and in 23% of these cases shopping bags are added. With lift = 1.37, this points to a positive link, possibly tied to larger grocery trips.
- Rule 8: from the opposite side, when shopping bags are purchased at any time, in 24% of cases rolls/buns and whole milk are included. The association remains with the same lift value.
- Rule 9: among customers who buy bottled water, in 26% of cases they also buy vegetables and whole milk at some point. With lift = 1.37 and Zhang’s metric = 0.35, this suggests a health-conscious customer profile.
- Rule 10: conversely, among customers who buy other vegetables and whole milk, 29% also tend to buy bottled water. This confirms the association, with identical lift.
- Rule 11: about 17% of customers buy pip fruit (e.g., apples, pears), and 24% of them also buy pastry. With lift = 1.35, this indicates a customer segment mixing healthy and indulgent choices.
- Rule 12: from the other direction, among customers who buy pastry, 23% also buy pip fruit. Again, the lift = 1.35, showing a complementary pattern within customer habits.
Strategic Insights & Recommendations (Customer Profiles)
Sausages, whole milk, and yogurt
- These items tend to co-occur in the same customer profiles, with sausages and whole milk acting as the foundation.
- Recommendation: target customers who consistently buy sausages and whole milk with personalized offers on yogurt. This leverages their existing buying pattern and encourages them to expand their basket over time.
Rolls/buns, vegetables, and sausages
- This combination suggests a profile oriented toward quick meals or snacks.
- Recommendation: segment these customers and create meal bundle campaigns (sausages + buns + vegetables). Personalized coupons or targeted promotions could reinforce their natural tendency.
Rolls/buns and whole milk with shopping bags
- Customers in this group tend to show larger purchasing habits (although the purchase of bags may reflect heavy items).
- Recommendation: Since purchases tend to be for only a few items, it is suggested to reinforce loyalty by expanding the variety and offers on bread rolls and milk.
Whole milk, vegetables, and bottled water
- This profile reflects a health-conscious customer segment.
- Recommendation: design targeted promotions highlighting health, such as discounts on fruit + water packs etc., or loyalty campaigns centered on healthy lifestyle baskets.
Pip fruit and pastry
- These customers show a dual behavior: mixing healthy choices with indulgent treats.
- Recommendation: promote this contrast with cross-category deals (e.g., "fresh & sweet" combos). This supports both parts of their behavior while increasing cross-selling opportunities.
In short, different historical profiles of low-frequency customers have been identified, which has made it possible to understand their preferences over time and propose strategies to maximise their return.
Additionally, it is shown how the hyperparameters of the algorithm could not have been different. See next cell.
# support of 0.1, even with a threshold of 1.2 for lift (so low that it does not indicate clear independence) and any number of items
frequent_itemsets_test = apriori(data_apriori, min_support=0.1, use_colnames=True, max_len=None)
rules_test = association_rules(frequent_itemsets_test, metric="lift", min_threshold=1.2)
rules_test = rules_test.sort_values(by='lift', ascending=False)
rules_test.head()
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | representativity | leverage | conviction | zhangs_metric | jaccard | certainty | kulczynski |
|---|
3.2 Second Approach: Basket-Level Transactions¶
In this second approach, each transaction corresponds to a single shopping basket (defined by customer and date) or purchase per visit.
This implies:
- Products are linked only if they are purchased together in the same basket.
- For example, Coca-Cola and snacks will be associated only if they appear in the same transaction.
- Patterns are built across all baskets in the dataset, even if they span months or years.
- The focus is less on the customer profile and more on what typically goes together within each basket.
data_apriori2 = df.copy()
data_apriori2['shopping_list'] = data_apriori2['Member_number'].astype(str) + "_" + pd.to_datetime(data_apriori2['Date']).dt.date.astype(str)
data_apriori2['purchase'] = 1
data_apriori2 = data_apriori2.drop_duplicates(subset=['shopping_list', 'itemDescription'])
data_apriori2 = data_apriori2.pivot(index='shopping_list', columns='itemDescription', values='purchase').fillna(0).astype(bool)
data_apriori2.head(2)
| itemDescription | Instant food products | UHT-milk | abrasive cleaner | artif. sweetener | baby cosmetics | bags | baking powder | bathroom cleaner | beef | berries | ... | turkey | vinegar | waffles | whipped/sour cream | whisky | white bread | white wine | whole milk | yogurt | zwieback |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| shopping_list | |||||||||||||||||||||
| 1000_2014-06-24 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
| 1000_2015-03-15 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | True | True | False |
2 rows × 167 columns
frequent_itemsets2 = apriori(data_apriori2, min_support=0.002, use_colnames=True, max_len=None)
rules2 = association_rules(frequent_itemsets2, metric="lift", min_threshold=1.1)
rules2 = rules2.sort_values(by='lift', ascending=False).reset_index(drop=True)
rules2.head(9)
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | representativity | leverage | conviction | zhangs_metric | jaccard | certainty | kulczynski | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (curd) | (sausage) | 0.033683 | 0.060349 | 0.002941 | 0.087302 | 1.446615 | 1.0 | 0.000908 | 1.029531 | 0.319493 | 0.032282 | 0.028684 | 0.068014 |
| 1 | (sausage) | (curd) | 0.060349 | 0.033683 | 0.002941 | 0.048726 | 1.446615 | 1.0 | 0.000908 | 1.015814 | 0.328559 | 0.032282 | 0.015568 | 0.068014 |
| 2 | (brown bread) | (canned beer) | 0.037626 | 0.046916 | 0.002406 | 0.063943 | 1.362937 | 1.0 | 0.000641 | 1.018191 | 0.276701 | 0.029292 | 0.017866 | 0.057613 |
| 3 | (canned beer) | (brown bread) | 0.046916 | 0.037626 | 0.002406 | 0.051282 | 1.362937 | 1.0 | 0.000641 | 1.014394 | 0.279398 | 0.029292 | 0.014190 | 0.057613 |
| 4 | (frozen vegetables) | (sausage) | 0.028002 | 0.060349 | 0.002072 | 0.073986 | 1.225966 | 1.0 | 0.000382 | 1.014726 | 0.189627 | 0.024012 | 0.014513 | 0.054158 |
| 5 | (sausage) | (frozen vegetables) | 0.060349 | 0.028002 | 0.002072 | 0.034330 | 1.225966 | 1.0 | 0.000382 | 1.006553 | 0.196155 | 0.024012 | 0.006510 | 0.054158 |
| 6 | (sausage) | (bottled beer) | 0.060349 | 0.045312 | 0.003342 | 0.055371 | 1.222000 | 1.0 | 0.000607 | 1.010649 | 0.193337 | 0.032658 | 0.010537 | 0.064559 |
| 7 | (bottled beer) | (sausage) | 0.045312 | 0.060349 | 0.003342 | 0.073746 | 1.222000 | 1.0 | 0.000607 | 1.014464 | 0.190292 | 0.032658 | 0.014258 | 0.064559 |
| 8 | (frankfurter) | (other vegetables) | 0.037760 | 0.122101 | 0.005146 | 0.136283 | 1.116150 | 1.0 | 0.000536 | 1.016420 | 0.108146 | 0.033261 | 0.016154 | 0.089214 |
A much higher number of transactions has resulted in much less pronounced associations. The data barely reflects relationships between products in the same basket. Only the first four rules have metrics that indicate dependency.
Moderately strong associations (lift > 1.36)
- Rule 1: about 3.4% of transactions include curd, and in 8.7% of those cases sausage is also present. With lift = 1.45, this suggests curd buyers are more likely than average to also pick sausages.
- Rule 2: conversely, among sausage purchases (6% of transactions), only 4.9% also contain curd. The confidence is lower, but the lift remains the same.
- Rule 3: about 3.8% of transactions include brown bread, and in 6.4% of those cases canned beer is also added. With lift = 1.36 and Zhangs metric = 0.28, this combination shows up more often than chance would suggest.
- Rule 4: from the opposite direction, among canned beer purchases (4.7%), 5.1% also include brown bread.
Therefore:
- Only a few notable links emerge at the transaction level, and they are far weaker than customer-level patterns. Maximum support for the rules reaches only 0.3%.
- The curd/sausage and brown bread/canned beer pairs stand out slightly more (lift > 1.3).
- Other associations like sausages with frozen vegetables or bottled beer exist, but none are relevant (lift < 1.22, support 0.002, zhangs < 0.19, etc.).
Overall, these findings indicate that the transaction-based rules are too infrequent and weak to serve as a reliable foundation for developing generic strategies.
Non-technical Final Report 🛒
Executive Summary: Driving Growth Through Customer Loyalty and Smart Selling¶
This analysis of sales data from 3,898 members reveals a business with strong sales growth but a significant opportunity to improve customer loyalty. Our primary challenge is a low customer retention rate, with the vast majority of customers making fewer than six purchases over a two-year period.
Key Findings:
- Critical Challenge: Customer loyalty is low, indicating a potential weakness in long-term engagement.
- Clear Strengths: The business shows a positive sales trend and possesses a distinct list of "star products" and predictable seasonal sales peaks.
- Untapped Potential: Specific product combinations are frequently purchased together, offering clear opportunities for targeted promotions and store layout optimization to increase basket size.
Top Recommendations:
- Launch a Customer Retention Program: Focus on converting single-purchase customers into repeat shoppers.
- Implement a Data-Driven Marketing Calendar: Align promotions with identified monthly seasonal peaks for key products.
- Develop Targeted Cross-Selling Strategies: Promote product bundles based on identified purchasing patterns (e.g., promote yogurt to customers buying sausages and milk).
1. The Core Challenge: Understanding and Improving Customer Loyalty¶
Our analysis reveals that the most significant opportunity for growth lies in increasing customer retention.
Finding 1: Low Purchase Recurrence¶
A staggering 9% of our customers make only one purchase and never return. The vast majority shop between 2 and 6 times over a two-year span (Figure 5). This pattern signals a weak customer loyalty loop that needs to be addressed to ensure sustainable growth.
Finding 2: Cohort Analysis Reveals Key Insights¶
By analyzing customer cohorts (Figure 6), we can pinpoint specific periods of success and failure in retention:
- A Notable Success: May and June 2015 showed exceptionally high retention rates for new customers. Action Point: We must investigate what business activities (promotions, events, etc.) occurred during this period and replicate them.
- A Clear Failure: Conversely, strategies in March 2015 attracted many new customers but failed entirely to re-engage existing ones. This highlights the need for balanced marketing efforts.
- Long-Term Value: A solid core of over 10% of customers remains loyal after 13+ months. This confirms that converting new shoppers into long-term customers yields significant value.
2. Decoding Sales Patterns: What, When, and How Much Customers Buy¶
While loyalty is a challenge, overall sales trends are positive, and clear patterns have emerged.
Opportunity 1: Capitalize on Seasonal Peaks¶
Sales are not uniform throughout the year. We've identified clear high-demand months (for further details, see Table 1):
- January: Prime time for "Dishes," "Hamburger meat," and "Frozen meals."
- August: High demand for "Cream cheese," "Hard cheese," and packaged fruit/vegetables.
- July & August: Strong preference for "White wine".
- November: "Canned fish" and "Misc. beverages" see a surge.
Recommendation: Align marketing campaigns and inventory management with these predictable seasonal trends to maximize sales.
Opportunity 2: Leverage Our "Star Products"¶
A small group of products, led by "whole milk", drives a disproportionately high volume of sales (Figure 2). While popular, these products are not always purchased together, suggesting they serve different customer needs. Promotions should be designed strategically rather than assuming a universal appeal.
Opportunity 3: Identify High-Value Shopping Trips¶
Certain days show anomalous sales peaks (Figure 7). By cross-referencing these dates with business records, we can identify the triggers for large shopping sprees and reproduce those conditions.
3. Strategic Merchandising: Unlocking the Power of the Shopping Cart¶
Analyzing what products are bought together provides a powerful tool for increasing the average transaction value. We identified two types of product relationships:
A. Long-Term Customer Habits (Lifecycle Analysis)¶
These five strong relationships represent strategic opportunities for personalized marketing and product placement:
- The "Full Meal" Profile: Customers buying
Sausagesandwhole milkhave a 42% chance of also buyingYogurt. Action: Bundle these items or place them in proximity. - The "Quick Meal" Profile: 50% of customers buying
sausagesandrolls/bunsalso addother vegetables. Action: Create a "quick dinner" display featuring these products. - The "Healthy" Profile:
Whole milk,vegetables, andbottled waterare frequently purchased together, indicating a health-conscious segment. - The "Snack" Profile: Customers who buy
pip fruitare likely (25%) to also buypastry(and vice-versa). Without any of the products being seasonal. Action: Place these items near each other to encourage impulse buys. - The "Big Shopper" Profile:
Rolls/bunsandwhole milkare often bought withshopping bags, signaling a large, planned shopping trip.
B. In-the-Moment Purchases (Basket Analysis)¶
While less frequent (found in only 0.2% of transactions), we noted two recurring pairs for smaller, tactical promotions:
Curdandsausages.Brown breadandcanned beer.
4. Summary of Strategic Recommendations¶
Based on this analysis, we propose a three-pronged strategy to drive business growth:
Boost Customer Retention:
- Priority: Develop and launch a loyalty program targeting customers after their first purchase.
- Analyze: Investigate the successful tactics from May/June 2015 and re-deploy them.
Increase Purchase Frequency and Value:
- Marketing: Create a promotional calendar based on the monthly product seasonality identified in Table 1.
- Merchandising: Restructure store layouts to place synergistic products like "pip fruit" and "pastry" together.
- Promotions: Design targeted offers and bundles based on the strong product associations found (e.g., the "Full Meal" and "Quick Meal" profiles).
Embrace Data-Driven Decisions:
- Action: Continuously cross-reference sales peaks (Figure 7) and retention spikes (Figure 6) with business actions to build a playbook of what works.
For a detailed breakdown of the methodology and data, please refer to the full technical report.